Genomes of the Strep genus were downloaded using the following command:
ncbi-genome-download --parallel 8 --section refseq --assembly-level complete,chromosome --format fasta --genus "Streptococcus" bacteria
All genomes were cleaned for:
In total, 123 genomes of Strep pneu and 452 genomes of non Strep pneu were prepared for kmc analysis.
A list of k-mer was prepared for counting: 21,31,33,100-105,110,115,120,125,130,135,140,145,150,155,160,165,170,175,180,185,190,195,200,205,210,220,230,235,240,245,250,255
green dots show intersected k-mers of Strep
Y-axis is the frequency of k-mers depth (in log scale)
The plot can be interactively inspected.
Looking at the plot, interestingly there was a shift pattern of k-mers betweenn Pneu and Non Pneu. The short k-mers (i.e. 21,31,33) were used as reference as they were often used in other k-mer based software for short reads. For the Pneu k-mers, on the X-axis, the frequency of the k-mers depth changed steadly following a smooth curve until 120. There was a flipping over of change of frequency between the 100s-mers and 200s-mers at the depth 67.
For selecting a suitable k-mer, a 100-mer may be a good start with some options:
Pneu sequences:
Non Pneu sequences: